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(54) Microprocessor with non-aligned circular addressing 



(57) A data processing system (1300) is provided 
with a digital signal processor (DSP) (1301) that has an 
instruction set architecture (ISA) that is optimized for in- 
tensive numeric algorithm processing. The DSP has du- 
al load/store units (.D1 , .D2) connected to dual memory 
ports (T1 , T2) In a level one data cache memory con- 



troller (1 720a). The DSP can execute two aligned data 
transfers each having a length of one byte, two bytes, 
four bytes, or eight bytes In parallel by executing two 
load/store instructions. The DSP can also execute a sin- 
gle non-aligned data transfer having a length of four 
bytes or eight bytes by executing a non-aligned load/ 
store instruction that utilizes both memory target ports. 
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Description 

TECHNICAL FIELD OF THE INVENTION 

[0001 1 This Invention relates to data processing devices, electronic processing and control systems and methods of 
their manufacture and operation, and particularly relates to memory access schemes of microprocessors optimized 
for digital signal processing. 

DESCRIPTION OF THE BACKGROUND AFCV 

[0002] Generally, a microprocessor is a circuit that combines the instruction-handling, arithmetic, and logical oper- 
ations of a computer on a single semiconductor integrated circuit. Microprocessors can be grouped into two general 
classes, namely general-purpose microprocessors and special-purpose microprocessors. General-purpose microproc- 
essors are designed to be programmable by the user to perform any of a wide range of taslcs, and are therefore often 
used as the central processing unit (CPU) in equipment such as personal computers. Special-purpose microprocessors, 
in corrtrast, are designed to provide performance improvement for specific predetemiined arithmetic and logical func- 
tions for which the user intends to use the microprocessor. By knowing the primary function of the microprocessor, the 
designer can structure the microprocessor architecture In such a manner that the perfomiance of the specific function 
by the special-purpose microprocessor greatly exceeds the perfomiance of the same function by a general-purpose 
microprocessor regardless of the program implemented by the user. 

[0003] One such function that can be perfonned by a special-purpose microprDcessor at a greatly improved rate is 
digital signal prDcessing. Digital signal processing generally involves the representation, transmission, and manipula- 
tion of signals, using numerical techniques and a type of special-purpose microprocessor known as a digital signal 
processor (DSP). Digital signal processing typically requires the manipulation of large volumes of data, and a digital 
signal processor is optimized to efffciently perform the intensive computation and memory access operations associated 
with this data manipulation. For example, computations for perfonming Fast Fourier Transfonns (FFTs) and for imple- 
menting digital filters consist to a large degree of repetitive operations such as multiply-and-add and muttiple-bil-shlft. 
DSPs can be specifically adapted for these repetitive functions, and provide a substantial performance improvement 
over general-purpose microprocessors In, for example, real-time applications such as Image and speech processing. 
[0004] DSPs are central to the operation of many of today's electronic products, such as high-speed modems^ high- 
density disk drives, digital cellular phones, complex automotive systems, and video -conferencing equipment. DSPs 
win enable a wide variety of other digital systems in the future, such as video-phones, network processing, natural 
speech interfaces, and ultra-high speed modems. The demands placed upon DSPs In these and other applications 
continue to grow as consumers seek increased perfomiance from their digital products, and as the convergence of 
the communications, computer and consumer industries creates completely new digital products. 
[0005] Microprocessor designers have increasingly endeavored to exploit parallelism to improve perfomiance. One 
parallel architecture that has found application in some modern microprocessors utilizes multiple Instruction fetch pack- 
ets and multiple instruction execution packets with multiple functional units, referred to as a Very Long Instnjction Word 
(VLIW) architecture. 

[0006] Digital systems designed on a single integrated circuit are referred to as an application specific integrated 
circuit (ASIC). MegaModules are being used in the design of ASICs to create complex digital systems a single chip. 
(MegaModule is a trademark of Texas Instruments Incorporated.) Types of MegaModules include SRAMs, FIFOs, 
register files, RAMs, ROMs, universal asynchronous receiver-transmitters (UARTs), programmable logic arrays and 
other such logic circuits. MegaModules are usually defined as Integrated circuit modules of at least 500 gates in com- 
plexity and having a complex ASIC macro function. These MegaModules are predesigned and stored in an ASIC design 
library. The MegaModules can then be selected by a designer and placed within a certain area on a new IC chip. 
[0007] Designers have succeeded in increasing the perfomr^ance of DSPs, and microprocessors in general, by in- 
creasing clock speeds, by removing data processing bottlenecks in circuit architecture, by incorporating multiple exe- 
cution units on a single processor circuit, and by developing optimizing compilers that schedule operations to be exe- 
cuted by the processor in an efficient manner. For example, non-aligned data access Is provided on certain microproc- 
essors. Complex instruction set computer (CISC) architectures (Intel, Motorola 6BK) have thorough support for non- 
aligned data accesses; however, reduced instruction set computer (RISC) architectures usually do not support non- 
aligned accesses with a single load or store instruction. Some RISC architectures allow two data accesses per cycle, 
but they allow only two aligned accesses. Certain CISC machines now allow doing two memory accesses per cycle 
as two non-aligned accesses. A reason for this is that the dual access implementations are superscalar implementations 
that are running code conftpatible with earlier scalar implementations. 

[0008] The increasing demands of technology and the marketplace make desirable even further structural and proc- 
ess improvements In processing devtees, application systems and methods of operation and manufacture. 
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SUMMARY OF THE INVENTION 

[0009] An examplary embodiment of the present invention seeks to provide a microprocessor and a method for 
accessing memory by a microprocessor that improves digital signal processing performance. Aspects of the invention 

s are specified in the claims. 

[0010] in an embodiment of the present invention, each .D unit of a DSP can load and store data items up to double 
words (up to 64 bits) at aligned addresses in a two port memory subsystem. The .D units can also access words and 
double words on any byte boundary, whether aligned or not The address generation circuitry in the first .D unit has a 
first address output connected to the first memory port and a second address output selectively connected to the 

10 second memory port. The address generation circuitry can providetwo addresses simultaneously to request two aligned 
data items. An extraction circuit is connected to the memory subsystem to provide a non-aligned data item extracted 
from two adjacent aligned data items requested by the .D unit. 

[0011] In another embodiment of the present invention, the address generation circuitry in the .D units is operable 
to forni an address for non-aligned double word instructions by combining a base address value and an offset value. 
IS [0012] In another embodiment of the invention, one or more additional .D units have similar addressing circuitry for 
non-aligned accesses. 

[0013] In another embodiment of the present invention, two .D units can simultaneously access aligned data items 
in the memory. 

[0014] In another enribodiment of the present invention, the memory subsystem has a plurality of memory banks 
20 connected to the extraction circuitry. Decode circuitry is connected to the first memory port and to the second memory 
port for receh/ing addresses. A plurality of address multiplexers are connected respectively to the plurality of memory 
banlcs such that the decode circuitry is operable to individually control each of the plurality of address multiplexers. 

BRIEF DESCRIPTION OF THE DRAWINGS 

25 

[0015] Other features and advantages of the present invention will become apparent by reference to the following 
detailed description when considered in conjunction with the accompanying drawings in which the Figures relate to 
the processor of Figure 1 unless otherwise stated, and in which: 

30 Figure 1 is a block diagram of a digital system with a digital signal processor (DSP), showing components thereof 

pertinent to an embodiment of the present invention; 

Figure 2 is a block diagram of the functional units, data paths and register files of the DSP; 
Figure 3A illustrates an opcode map for the load/store instructions which are executed in a .D unit of the DSP; 
Figure 3B illustrates an opcode map for the load/store non-aiigned double word instruction whch are executed in 
55 a .0 unit of the DSP; 

Figure 3C illustrates an opcode map for Boolean instructions executed by the .D units; 
Figure 4 illustrates an addressing mode register (AMR) of the DSP; 

Figures 5A, 5B and 5C illustrate aspects of non-aligned address formation and non-aligned data extraction from 
a circular buffer region, according to an aspect of the present invention; 
^ Figure 6 illustrates the basic fomnat of a fetch packet of the DSP; 

Figure 7 is a memory map of a portion of the memory space of the DSP and illustrates various aligned and non- 
aligned memory accesses; 

Figure 8 is a block diagram illustrating D-unit address buses of the DSP in more detail and illustrating the two ports 
of the DSP memory; 

<5 Figure 9 is a block diagram of the memory of Figure 8 illustrating address decoding of the two address ports and 

byte selection circuitry to extract a non-aligned data item according to an embodiment of the present invention; 

Figure 10 is a block diagram illustrating the extraction circuitry of Figure 9 in more detail; 

Figure 11 is a block diagram illustrating the store byte selection circuitry for storing non-aligned data itenns in the 

memory system Rgure 8 in more detail; 
so Figure 1 2A is a more detailed block diagram of the D-unit of the DSP; 

Figure 12B Is a more detailed block diagram of the circular buffer circuitry of Figure 1 2A; 

Figure 12C is a flow chart illustrating fonnation of circular buffer addresses for both aligned access instruction 

types and non-aligned access instruction types, according to an aspect of the present invention; 

Figure 1 3 is a block diagram of an alternative embodiment of the present invention in digital system having a DSP 
ss with a data cache; 

Figure 14A and 148 together Is a block diagram of the data cache of Figure 13; and 

Figure 15 illustrates an exemplary implementation of a digital system that includes an embodiment of the present 
Invention in a mobile telecommunications device. 
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DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS 

[0016] Corresponding numerals and symbols In the different figures and tables refer to corresponding parts unless 
otherwise Indicated. 

[00171 Figure 1 is a block diagram of a microprocessor 1 that has an embodiment of the present Invention. Micro- 
.processorl is a RISC VLIW digital signal processor C'DSP"). In the interest of clarity, Figurel only shows those portions 
of microprocessor 1 that are relevant to an understanding of an embodiment of the present Invention. Details of general 
construction for DSPs are well known, and may be found readily elsewhere. For example, U.S. Patent 5.072,41 8 Issued 
to Frederick Boutaud, et al, describes a DSP In detail. U.S. Patent 5,329.471 issued to Gary Swoboda, et al, describes 
in detail how to test, and emulate a DSR Details of portions of microprocessor 1 relevant to an embodiment of the 
present invention are explained in sufficient detail hereinbelow, so as to enable one of ordinary skill in the mkiroproc- 
essor art to make and use the invention. 

[0018] In microprocessor 1 there are shown a central processing unit (CPU) 1 0, data memory 22, program memory/ 
cache 23. peripherals 60 and an external memory interface (EMIF) with a direct memory access (DMA) 61. CPU 10 
further has an instruction fetch/decode unit 1 0a-c, a plurality of execution units, including an arithmetic and load/store 
unit D1 . a multiplier Ml . an ALU/shifter unit SI . an arithmetic logic unit ("ALU") L1 , a shared multi-port register file 20a 
from which data are read and to which data are written. Instructions are fetched by fetch unit 10a from instruction 
memory 23 over a set of busses 41 , Decoded instructions are provided from the instruction fetch/decode unit 1 0a-c to 
the functional units 01 , M1 , SI , and LI over various sets of control lines which are not shown. Data are provided to/ 
from the register file 20a fronoAo to load/store units D1 over a first set of busses 32a. to multiplier M1 over a second 
set of busses 34a, to ALU/shifter unit SI over a third set of busses 36a and to ALU L1 over a fourth set of busses 38a. 
Data are provided to/from the memory 22 from/to the load/store units D1 via a fifth set of busses 40a. Note that the 
entire data path described above is duplicated with register file 20b and execution units D2, M2, 82, and L2. In this 
embodiment of the present invention, two unrelated aligned double word (64 bits) load/store transfers can be made in 
parallel between CPU 10 and data memory 22 on each clock cycle using bus set 40a and bus set 40b. 
[0019] An aligned transfer is one in which the address of a datum modulo its size (in addresslble units) is 0. For 
example, in a system where each 8-bit datum has an address, a 16-brt datum (2 addressrble units) at address 
0x24002576 Is aligned because 0x24002676 mod 2 Is 0. Converesly, a non-aligned transfer is one in which the address 
of a datum modulo Its size (In addresslble units) Is non-zero. For example, In a system where each 8-bft datum has an 
address, a 32-bit datum (4 addresslble units) at address 0x24F78006 is non-aligned because 0x24F78006 mod 4 is 
2, not zero. 

[0020] A single non-aligned double word load/store transfer is performed by scheduling a first .D unit resource and 
two load/store ports on memory 22. Advantageously, an extraction circuit is connected to the memory subsystem to 
provide a non-aligned data item extracted from two aligned data items requested by the .D unit. Advantageously, a 
second .D unit can perfomn 32-bit logical or arithmetk; inslmctions in addition to the S and L units while the address 
port of the second .D unit Is being used to transmit one of two contiguous addresses provided by the first .D unit. 
Furthemiore, a non-aligned access near the end of a circular buffer region in the target memory provides a non-aligned 
data Item that wraps around to the other end of the circular buffer 

[0021] Emulation circuitry 50 provides access to the internal operation of integrated circuit 1 that can be controlled 
by an external test/development system (XDS) 51 . External test system 51 is representative of a variety of known test 
systems for debugging and emulating integrated circuits. One such system is described In U.S. Patent 5,535,331 . Test 
circuity 52 contains control registers and parallel signature analysis circuitry for testing integrated circuit 1 
[0022] Note that the memory 22 and memory 23 are shown in Figure 1 to be a part of a microprocessor 1 integrated 
circuit, the extent of which is represented by the box 42. The memories 22-23 could just as well be external to the 
microprocessor 1 integrated circuit 42, or part of it could reside on the integrated circuit 42 and part of It be external 
to the integrated circuit 42. These are matters of design choice. Also, the partteular selection and number of execution 
units are a matter of design choice, and are not critical to the Invention. 

[0023] When microprocessor 1 Is incorporated in a data processing system, additional memory or peripherals may 
be connected to microprocessor 1 . as Illustrated In Figure 1 . For example, Random Access Memory (RAM) 70, a Read 
Only Memory (ROM) 71 and a Disk 72 are shown connected via an external bus 73. Bus 73 is connected to the External 
Memory Interface (EMIF) which is part of functional block 61 within microprocessor 1 . A Direct Memory Access (DMA) 
controller is also included within block 61. The DMA controller is generally used to move data between memory and 
peripherals within microprocessor 1 and memory and peripherals which are external to mfcroprocessor 1 . 
[0024] In the present embodiment. CPU core 1 0 is encapsulated as a MegaModule, however, other embodiments 
of the present invention may be in custom designed CPU's or mass maricet microprocessors, for example. 
[0025] A detailed description of various architectural features of the microprocessor 1 of Figure 1 is provided in 
coasslgned European Patent Application No.98101291 .7 filed 26^ January 2000. A description of enhanced architec- 
tural features and an extended instruction set not described herein for CPU 10 is provided in coasslgned European 



EP1126368 A2 



Patent Application No.0031 0098.9 filed 14t^ November 2000. 

[0026] Figure 2 is a block diagram of the execution units and register files of the microprocessor of Figure 1 and 
shows a more detailed view of the buses connecting the various functional blocks. In this figure, all data busses are 
32 bits wide, unless otherwise noted. There are two general-purpose register files (A and B) in the processor's data 
s paths. Each of these files contains 32 32-blt registers (A0-A31 for file A and B0-B31 for file B). The general-purpose 
registers can be used for data, data address pointers, or condition registers. Any number of reads of a given register 
can be perfomned in a given cycle. 

[0027] The general-purpose register files support data ranging in size from packed 8-bit data through 64-bit fixed- 
point data. Values larger than 32 bits, such as 40-bit long and 64-bit double word quantities, are stored in register pairs, 
10 with the 32 LSBs of data placed in an even-numbered register and the remaining 8 or 32 MSBs in the next upper 
register (wtitch is always an odd-numbered register). Packed data types store either four 8-bit values or two 16-blt 
values in a single 32-bit register. 

[0028] There are 32 valid register pairs for 40-blt and 64-bft data, as shown in Table 1 . In assembly language syntax, 
a colon between the register names denotes the register pairs and the odd numbered register is spec'rfied first. 
IS [0029] Operations requiring a long input ignore the 24 MSBs of the odd register. Operations producing a long result 
zero-fill the 24 MSBs of the odd register. The even register is encoded in the opcode. 

[0030] Alt eight of the functional units have access to the opposite side's register file via a cross path. The .M1 , .M2, . 
S1 , .S2, .D1 and .02 units' src2 inputs are selectable between the cross path and the same side register file by appro- 
priate selection of multiplexers 21 3, 21 4 and 215, for example. In the case of the .LI and .12 both srcl and src2 inputs 
^ are also selectable between the cross path and the same^ide register file by appropriate selection of multiplexers 211, 
212, for example. 



Table 1. 



45 



SO 



40-Bit/64-Bit Register Pairs 


Register Files 


A 


B 


A1:A0 


B1:B0 


A3:A2 


B3:B2 


A5:A4 


B5:B4 


A7:A6 


67:B6 


A9:A8 


B9:B8 


AlliAlO 


B11:B10 


A13:A12 


B13:B12 


A15:A14 


B15:B14 


A17:A16 


B17:B16 


A19:A18 


B19:B18 


A21:A20 


B21:B20 


A23:A22 


B23:B22 


A25:A24 


B25:B24 


A27:A26 


B27:B2B 


A29:A28 


B29:B28 


A31:A30 


B31:B30 



[0031] Referring again to Figure 2, the eight functional units In processor 10's data paths can be divided into two 
groups of four; each functional unit in one data path is almost identical to the corresponding unit in the other data path. 
The functional units are described In Table 2. 



5 



EP1 126368 A2 



Table 2. 





Functional Units and Operations Perfomied 


5 


Functional Unit Operations 


Fixed-Point 










.L unit (.LI, .L2), 18a,b arithmetic and compare operations 


32/40-blt 


10 


32-bit logical operations 




Leftmost 1 or 0 counting for 32 bits 






Nonnalization count for 32 and 40 bits 






shifts 


Byte 


IS 


packing/unpacking 


Data 




constant generation 


5-blt 


20 


Paired 16-bit arithmetic operations 
8-bit arithmetic operations 


Quad 


25 


Paired 16-bit min/max operations 
8-blt mIn/max operations 


Quad 




.S unit (.SI, .S2) 16a,b operations 


32-bit arithmetic 




32/40-blt shifts and 32-blt bft-field operations 




30 


32-bit logical operations 




Branches 






Constant generation 






Register transfers to/from control register file (.S2 only) 




35 


shifts 


Byte 




packing/unpacking 


Data 


40 


Paired 16-blt compare operations 
8-bit compare operations 


Quad 




Paired 16-bit shift operations 




45 


Paired 16-bit saturated arithmetic operations 
8-bit saturated arithmetic operations 


Quad 




.M unit (.Ml, .M2) 14a,b operations 


16 X 16 multiply 


SO 


32 multiply operations 


16x 




expansion 


Bit 




interleaving/de-interieaving 


Bit 


55 


8x8 multiply operations 


Quad 








Paired 16x16 multiply operations 
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Table 2. (continued) 



Functional Units and Operations Performed 


Functional Unit Operations 


Fixed-Point 


Paired 16x16 multiply with add/subtract operations 
8x8 multiply with add operations 


Quad 


Variable shift operations 
Rotation 

Galois Field Multiply 








.D unit (.D1 , .02) 12a,b subtract, linear and circular address calculation 


32-bit add 


Loads and stores with 5-blt constant offset 




Loads and stores with 15-bit constant offset (.02 only) 




Load and store double words with 5-blt constant 




Load and store non-aligned words and double words 




bit constant generation 


5- 


bit logical operations 


32- 



[0032] Most data lines In the CPU support 32-blt operands, and some support long {40-blt) and double word {64-blt) 
operands. Each functional unit has its own 32-bit write port into a general-purpose register file 20a, 20b (Refer to Figure 
2). All units ending in 1 (for example, 11 ) write to register file A 20a and all units ending in 2 write to register file B 20b. 
Each functional unit has two 32-bit read ports for source operands srcl and src2. Four units (.LI , .L2, .SI , and .S2) 
have an extra 8-bit-wlde port (long-dsl) for 40-btt long writes, as well as an 8-bit input (long-src) for 40-bit long reads. 
Because each unit has its own 32-bit write port dst, when perfomiing 32 bit operations all eight units can be used in 
parallel every cycle. Since each multiplier can return up to a 64-bit result, two write ports (dst1 and dst2) are provided 
from the multipliers to the register file. 



Memory. Load and Store Paths 

[0033] Processor 1 0 supports double word loads and stores. There are tour 32-bit paths for loading data from memory 
to the register file. For side A, LD1 a is the load path for the 32 LSBs (least significant bits); LD1 b is the load path for 
the 32 MSBs (most significant bits). For side B, LD2a is the load path for the 32 LSBs; LD2b is the load path for the 
32 MSBs. There are also four 32-bit paths, for storing register values to memory from each register file. STIa is the 
write path for the 32 LSBs on side A; ST1 b is the write path for the 32 MSBs for side A. For side B. ST2a is the write 
path for the 32 LSBs; ST2b is the write path for the 32 MSBs. 

[0034] The ports for long and double word operands are shared between the S and L functional units. This places 
a constraint on which long or double word operations can be scheduled on a datapath in the same execute packet. 

Data Address Paths 

[0035] Bus 40a has an address bus DA1 which is driven by mux 200a. This allows an address generated by either 
load/store unit D1 or D2 to provide a memory address for loads or stores for register file 20a. Data Bus LD1 a.b loads 
data from an address In memory 22 specified by address bus DAI to a register In register file 20a. Likewise, data bus 
STIa.b stores data from register file 20a to memory 22. Load/store unit D1 performs the following operations: 32-bit 
add, subtract, linear and circular address cafculations. Load/store unit D2 operates similariy to unit Dl, with the assist- 
ance of mux 200b for selecting an address. 

[0036] The DAI and DA2 address resources and their associated data paths are connected to target memory ports 
on memory 22 specified as T1 and T2 respectively. T1 connects to the DAI address path and the LDla, LDIb, STIa 
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and ST1 b data paths. Similarly, 12 connects to the DA2 address path and the LD2a, LD2b, S72a and ST2b data paths. 
The T1 and T2 designations appear in functional unit fields for load and store instructions. 
[0037] For example, the following load instruction uses the .01 unit to generate the address but is using the LD2a 
path rGsource with DA2 address bus connected to target portT2 to place the data in the B register file: 
LDW.D1T2*A0[3], B1. 

The use of the DA2 address resource is indicated with the T2 designation. 
Instruction Syntax 

[0038] An instruction syntax is used to describe each instruction. An opcode map breaks down the various bit fields 
that make up each instruction. There are certain instructions that can be executed on more than one functional unit. 
The syntax specifies the functional unit and various resources used by an instruction, and typically has a form as 
follows: operation, .unit, src, dst. src and dst indicate source and destination operands respectively. The .unit dictates 
which functional unit the instruction is mapped to (.LI , .L2, .SI , .S2, .M1 , .M2, .01 . or .02). Several instructions have 
three opcode operand fields: srcl , src2, and dst. 

[0039] Figure 3A Illustrates an opcode map for the load/store instructions which are executed In a .0 unit of the DSP. 
Figure 3B illustrates an opcode map for non-aligned double word load/store instructions, and Figure 3C illustrates an 
opcode map for Boolean instructions executed by the .0 units. Table 3 lists the opcodes for various load store (LD/ST) 
instructions performed by the CPU of the present embodiment. Opcode field 51 0 and R-field 51 2 define the operation 
of the LD/ST Instructions. An aspect of the present invention is that processor 10 perfomns non-aligned load and store 
Instructions by using resources of one D unit and both target ports T1 and T2. as will be described in more detail below. 
Advantageously, the second D unit is available to execute a Boolean or arithmetic instruction in parallel with the exe- 
cution of a non-aligned load/store instruction. 

[0040] The dst field of the LD/STNDW instruction selects a register pair, a consecutive even-numbered and odd- 
numbered register pair from the same register file. The Instruction can be used to load a pair of 32-bit Integers. The 
least significant 32 bits are loaded into the even-numbered register and the most signifrcant 32 bits are loaded into the 
next register (which is always an odd-numbered register). 



Tables. 



Load/Store Instruction Opcodes 


R-Opcode extension 


LD/ST Op 


Instruc tion 


Size 


Alignment 


0 


000 


LDHU 


Half word unsigned 


Half word 


0 


001 


LOBU 


Byte unsigned 


Byte 


0 


010 


LDB 


Byte 


Byte 


0 


Oil 


STB 


Byte 


Byte 


0 


100 


LOH 


Half word 


Half 


0 


101 


STH 


Half word 


Half word 


0 


110 


LOW 


Word 


Word 


0 


111 


STW 


Word 


Word 




010 


LDNDW 


Double word 


byte( non*al^ned) 




oil 


LONW 


Word 


Byte (non-aligned) 




100 


STOW 


Double word 


Double word 




101 


STNW 


Word 


Byte (non-aligned) 




110 


LODW 


Double word 


Double word 




111 


STNDW 


Double word 


Byte (non-aligned) 



Addressing Modes 

[0041] The addressing modes are linear, circular using block size field BKO. and circular using block size field BK1 . 
Eight registers can perform circular addressing. A4-A7 are used by the .D1 unit and B4-B7 are used by the .02 unit. 
No other units can perform circular addressing modes. For each of these registers, an addressing mode register (AMR) 
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contained in control register file 1 02 specifies the addressing mode. The block size fields are also in the AMR. 
[0042] Refe^ing again to Figure 3A and 3B, linear mode addressing simply shifts the offsetR/cst operand 51 6 to the 
left by 3, 2, 1 , or 0 for double word, word, half-word, or byte access respectively and then performs an add or subtract 
to baseR 514, depending on the address mode specified. For the pre-increment, pre- decrement, positive offset, and 

5 negative offset address generation options, the result of the calculation is the address to be accessed in memory. For 
post-increment or post-decrement addressing, the value of baseR before the addition or subtraction is the address to 
be accessed from memory. Address modes are specified by mode field 500 and listed in Table 4. The increment/ 
decrement mode controls whether the updated address Is written back to the register file. Otherwise, It is rather similar 
to offset mode. The pre-increment and offset modes differ only in whether the result Is written back to "base". The post- 

10 increment mode is similar to pre-increment (e.g. the new address is written to "base"), but differs in that the old value 
of "base" Is used as the address for the access. The same applies for negative offset vs. decrement mode. 



Table 4- 



20 



30 



Address Generator Options 


IVIode Field 


Syntax 


Modification Performed 


0101 


*+R[offsetR] 


Positive offset; addr = base + offset * scale 


0100 


•-R[offsetR] 


Negative ofteet; addr = base - offset * scale 


1101 


•++R[offsetRl 


Preincrement; addr = base + offset * scale; base = addr 


1100 


*--R[ offsetR] 


Predecrement; addr = base - offset * scale; base = addr 


1111 


•R++[offsetR] 


Postincrement; addr = base; base = base + offset * scale 


1110 


*R-[ OffsetR] 


Postdecrement; addr = base; base = base - offset * scale 


0001 


*+R[ ucst51 


Positive offset; addr = base -i- offset * scale 


0000 


•-R[ ucstS] 


Negative offeet; addr = base - offset * scale 


1001 


*-M-R[ucst5] 


Preincrement; addr base + offset * scale; base ^ addr 


1000 


*- -R[ ucst5] 


Predecrement; addr = base - offset * scale; base = addr 


10 11 


*R++[ ucstS] 


Postincrement; addr = base; base = base + offset * scale 


1010 


*R- -[ ucst5] 


Postdecrement; addr = base; base = base - offset * scale 



[0043] Figure 4 illustrates the addressing mode register, (AMR), whteh is included In control register file 102, is 
accessible via a "move between control file and the negisterfile" (MVC) instruction. Eight registers (A4-A7, B4-B7) can 
perform circular addressing. For each of these registers, the AMR specifies the addressing mode. A 2-bit field for each 
register is used to select the address modifbalion mode: linear (the default) or circular mode. With circular addressing, 
the field also specifies which BK (block size) field to use for a circular buffer. In this embodiment, the buffer must be 
aligned on a byte boundary equal to the block size. The mode select field encoding is shown in Table 5. 



Table 5. 



Addressing Mode Field Encoding 


Mode 


Description 


00 


Linear modification (default at reset) 


01 


Circular addressing using the BKO field 


10 


Circular addressing using the BK1 field 


11 


Reserved 



[0044] The block size fields, BKO and BKl , specify block sizes for circular addressing. The five bits in BKO and BK1 
specify the width. The formula for calculating the block size width is: 

55 

Block Size (In bytes) = 2^""^* 
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where N is the value in 6K1 or BKO 
[0045] Table 6 shows blocic size calculations for all 32 possibilities. 



Table 6. 



5 



10 



IS 



20 



25 



30 



Block Size Calculations 


N 


Block Size 


N 


Block Size 


00000 


2 


10000 


131.072 


00001 


4 


10001 


262,144 


00010 


8 


10010 


524,288 


00011 


16 


10011 


1 ,048.576 


00100 


32 


10100 


2,097.152 


00101 


64 


10101 


4,194,304 


00110 


128 


10110 


8,388,608 


00111 


256 


10111 


16,777^16 


01000 


512 


11000 


33,554,432 


01001 


17024 


11001 


67.108,864 


01010 


27048 


11010 


134,217,728 


01011 


47096 


11011 


268,435,456 


01100 


B?192 


11100 


536,870,912 


01101 


16?aB4 


11101 


1,073,741,82 4 


01110 


327768 


11110 


2,147.483,64 8 


01111 


657536 


11111 


4,294,967.29 6 


Note: when N is 11 1 11 , the behavior is identical to linear address- 
ing 



[0046] Circular mode addressing uses the BKO and BK1 fields in the AMR to specify block sizes for circular address- 
35 ing. Circular mode addressing operates as follows with LD/ST Instructions: after shifting offsetR/cst to the left by 3, 2, 
1 , or 0 for LDDW, LDW, LDH, or LDB respectively, and is then added to or subtracted from baseR to produce the final 
address. This add or subtract is perfonned by only allowing bits N through 0 of the result to be updated, leaving bits 
31 through N+1 unchanged after address arithmetic. The resulting address is bounded to 2^(N+1) range, regardless 
of the size of the offsetR/cst. 

^ [0047] The circular buffer size in the AMR is not scaled; for example: a size of 8 is 6 bytes, not 8 x size of (type). 
So, to perform circular addressing on an array of 8 words, a size of 32 should be specified, or N t= 4. Table 7 shows 
an example LDW Instructions performed with register A4 in circular mode, with BKO = 4, so the buffer size Is 32 bytes, 
1 6 halfwords, or 8 words. The value put in the AMR for this example is 00040001 h. in this example, an offset of "9" is 
specified. 9h (hexadecimal) words is 24h bytes. 24h bytes is 4 bytes beyond the 32-byte (20h) boundary lOOh-11 Fh; 

^ thus, It is wrapped around to (124h - 20h = 1 04h). 



Table 7. LDW in Circular Mode 



so 



55 



LDW.D1 *++A4[91,A1 




Before LDW 




1 cycle after LDW 




5 cydes after LDW 




A4 


0000 
OlOOh 




A4 


0000 0104h 




A4 


0000 
0104h 






















A1 


xxxx 

XXXXh 




A1 


XXXX XXXXh 




A1 


1234 
5678h 
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Table 7. LOW in Circular Mode (continued) 



LDW.D1 *-M-A4[9],A1 


Mem 


1234 




mem 104h 


12346678h 




mem 104h 


1234 


Before LDW 




1 cycle after LDW 




5 cydes after LDW 


104h 


5678h 












5678h 



Non-Allgned Memory Access Considerations 

[0048] Circular addressing may be used with non-aligned accesses. When circular addressing is enabled, address 
updates and memory accesses occur in the same manner as for the equivalent sequence of byte accesses. The only 
restriction Is that the circular buffer size be at least as large as the data size being accessed. Non-aligned access to 
circular buffers that are smaller than the data being read produce undefined results. 

[0049] Non-aligned accesses to a circular buffer apply the circular addressing calculation to logically adjacent mem- 
ory addresses. The result Is that non-aligned accesses near the boundary of a circular buffer will correctly read data 
from both ends of the circular buffer, thus seamlessly causing the circular buffer to "wrap around" at the edges. 
[0050] Figures 5A, 5B and 5C illustrate aspects of non-aligned address formation and non-aligned data extraction 
from a circular buffer region, according to an aspect of the present invention. Consider, for example, a circular buffer 
500 that has a size of 1 6 bytes Illustrated In Figure 5A. A circular buffer of this size Is specified by setting either BKO 
or BK1 to "0001 1 For example with register A4 in circular mode and BKO = 3, the buffer size is 1 6 bytes, 8 half words, 
or 4 words. The value put in the AMR for this example is 00030001 h. The buffer starts at address 0x0020 (502) and 
ends at Ox002F (504). The register A4 is Initialized to the address 0x0028, for example; however, the buffer could be 
located at other places In the memory by setting more significant address bits In register A4. Below the buffer at address 
0x1 F (606) and above the buffer at address 0x30 (508) data can be stored that is not relevant to the buffer. 
[0051 ] The effect of circular buffering is to make It so that memory accesses and address updates In the 0x20 - 0x2F 
range stay completely inside this range. Effectively, the memory map behaves as illustrated in Figure 5B. Executing a 
LDW instruction with an offset of 1 in post increment mode will provide an address of 0x0028 (511) and access word 
510, for example. Executing the instruction a second time will provide an address of Ox002C (513) and access word 
512 at the end of the circular buffer. Executing the instruction a third time will provide an address of 0x0020 (502a) and 
access word 514. Note that word 514 actually corresponds to the other end of the circular buffer, but was accessed 
by incrementing the address provided by the LDW instruction. 

[0052] Figure 6C illustrates the operation of an access into the circular buffer using a non-aligned load/store instruc- 
tion, in this example, A4 Is initialized to the address 0x002A and a non-aligned double word load instruction (LDNDW) 
with a non-scaled offset of "1 " in post increment mode. As discussed above, two addresses will be sent to the two ports 
on the memory. An address of 0x002A (534) will be sent on the first port, which results in accessing an aligned double 
word DW1 from memory, aligned on address 0x0028 (530). A second address will be sent to the memory system that 
is incremented by the line size of the instruction. Since in this exannple the Instruction is a double word instruction, the 
line size is two words, or eight bytes. Thus the second address in Incremented by eight bytes to be 0x0032. However, 
according to an aspect of the present invention, this address is bounded to 0x0022 (536) by circular addressing circuitry 
to remain within the bounds of circular buffer region 500. The memory system then accesses a second aligned double 
word DW2, aligned at address 0x0020 (532). Extraction circuitry then extracts non-aligned double word NADW1 from 
the two logically adjacent double words DW1 and DW2, even though they are actually physically from different ends 
of the circular buffer. 

[00531 Still referring to Figure 5C, executing the LDNDW instruction again results in sending a first address of 0x002B 
(538) incremented by a non-scaled offset of and a second address of 0x0023 (540) Incremented by the line size 
and bounded to remain within the circular buffer region to the memory system. The memory system will access the 
same two aligned double words DW1 , DW2. However the extraction circuitry now extracts non-aligned double word 
NADW2 in response in response to incremented address 538. 

[0054] As another example, Table 8 shows an LDNW performed with register A4 in circular mode and BKO = 3, so 
the buffer size is 16 bytes, 8 half words, or 4 words. The value put In the AMR for this example is 00030001h. The 
buff er starts at address 0x0020 and ends at 0x002F. The register A4 Is initialized to the address 0x002A. In this example, 
on offset of "2" is specified. 2h words is 8h byles. 8h bytes is 3 bytes beyond the 16 byte (10h) boundary starting at 
address 002Ah; thus, it is wrapped around to 0022h (002Ah + 8h = 0022h). In this example, the two address sent to 
the memory subsystem are contiguous; the first address is 0x0022 and the second address is incremented by a line 
size of 4h , to become 0x0026. 
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Tables. 



10 



IS 



LDNW in Circular Mode 


LDNW .D1 *++A4[2],A1 




Before LDW 


1 cycle after LDW 




5 cycles after LOW 




A4 


0000 
002Ah 


A4 


0000 
0022h 




A4 


0000 
00022h 




















A1 


xxxx 

XXXXh 


A1 


XXXX 
XXXXh 




A1 


5678 9ABCh 


Mem 




5678 


mem 


5678 




mem 


5678 9ABCh 


0022h 


9ABCh 


0022h 


9ABCh 




0022h 





[0055] Figure 6 illustrates the basic fomiat of a fetch packet of the DSP. In this embodiment, instructions are always 
fetched eight at a time. This constitutes a fetch packet. The execution grouping of the fetch packet is specified by the 
20 p-bit, bit zero, of each instruction. Fetch packets are 8-word aligned and can contain up to eight instructions. A p bit in 
each instruction controls the parallel execution of instructions. A set of instructions executing in paraliel constitute an 
execute packet. An execute packet can contain up to eight instructions. 

[0056] The p bit controls the parallel execution of instructions. The p bits are scanned from left to right (lower to 
higher address) . If the p btt of Instruction I Is 1 , then Instruction I +1 is to be executed In parallel with (In the same cycle 
25 as) Instruction I. If the p-bil of instruction i is 0, then instruction i + 1 is executed in the cycle after instruction i. All 
instructions executing in parallel constitute an execute packet. An execute packet can contain up to eight instructions. 
All instructions in an execute packet must use a unique functional unit. 

[0057] There are three types of p-bIt patterns for fetch packets which result In the following execution sequences for 
the eight instructions: fully serial; fully parallel, or partially parallel. As discussed above, In this embodiment of the 
30 invention, a non-aligned fetch or store instruction uses only one .D unit, so that the other .D unit is available for parallel 
execution of an instructton that does not access a load/store port on memory 22. The processor can access words and 
double words at any byte bou ndary using non-aligned loads and stores. As a result, word and double word data does 
not always need alignment to 32-bit or 64-bit boundaries. 

[0058] As an example of parallel execution of load/store instmctions, the following execute packet is invalid in the 
55 present embodiment because there are two memory operations and one of them is non-aligned: 

LDNW .D2T2 *B2[B1 2],B1 3; II LDB .D1T1 *A2,A1 4; However, the following execute packet Is valid in the present em- 
bodiment because there Is a non-memory Boolean/arithmetic instruction being executed on .D1 in parallel with a non- 
aligned load instruction on .02: 

LDNW .D2T2 •B2[B12]. A13; II ADD ,D1x A12, B13. A14; 

40 

P^eline Operation 

[0059] The instruction execution pipeline of DSP 1 has several key features whteh improve perfomiance, decrease 
cost, and simplify programming, including: increased pipelining eliminates traditional architectural bottlenecks in pro- 
45 gram fetch, data access, and multiply operations; control of the pipeline is simplified by eliminating pipeline interlocks; 
the pipeline can dispatch eight parallel instructions every cycle; parallel instojctions proceed simultaneously through 
the same pipeline phases; sequential instructions proceed with the same relative pipeline phase difference; and load 
and store addresses appear on the CPU boundary during the same pipeline phase, eliminating read-after-wrlte memory 
conflicts. 

so [0080] A multi-stage memory pipelirie is present for both data accesses In memory 22 and program fetches in memory 
23. This allows use of high-speed synchronous memories both on-chip and off-chip, and allows infinitely nestable zero- 
overhead looping with branches in parallel with other instructions. 

[0061] There are no internal interlocks in the execution cycles of the pipeline, so a new execute packet enters exe- 
cution every CPU cycle. Therefore, the number of CPU cycles for a particular algorithm with particular input data is 
55 fixed. If during program execution, there are no memory stalls, the number of CPU cycles equals the number of ctock 
cycles for a program to execute. 

[0062] Performance can be inhibited by stalls from the memory system, stalls for cross path dependencies, or Inter- 
rupts. The reasons for memory stalls are determined by the memory architecture. Cross path stalls are described in 
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detail in U.S. Patent Application No.09/702,456, to Steiss, et al. To fully understand how to optimize a program for 
speed, the sequence of program fetch, data store, and data load requests the program makes, and how they might 
stall the CPU should be understood. 

[0063] The pipeline operation, from a functional point of view, is based on CPU cycles. A CPU cycle is the period 
during which a particular execute packet is in a particular pipeline stage. CPU cycle boundaries always occur at clock 
cycle boundaries; however, memory stalls can cause CPU cycles to extend over multiple clock cycles. To understand 
the machine state at CPU cycle boundaries, one must be concerned only with the execution phases (E1-E5) of the 
pipeline. The phases of the pipeline are described in Table 9. 



10 Table 9. 





Pipeline Phase Description 




Pipeline 


Pipeline Phase 


Symbol 


During This Phase 


Instruction Types 
Completed 


IS 


ProcirRm Fetch 


Prnnrsm AHHrocc 

Generate 




Address of the fetch 
packet is determined. 








Program Address Send 


PS 


Address of fetch packet Is 
sent to memory. 




20 




Program Wait 


PW 


Program memory access 

is performed. 








Program Data Receive 


PR 


Fetch packet is expected 
at CPU boundary. 




25 


Program Decode 


Dispatch 


DP 


Next execute oacket in 
fetch packet detemnined 
and sent to the 
appropriate functional 




30 








units to be decoded. 






Decode 


DC 


instructions are decoded 
at functional units. 






Execute 


Execute 1 


E1 


For all instruction types, 
conditions for instructions 


Singie-cycie 


35 








are evaluated and 
operands read. 
Load and store 
instructions: address 




40 








generation is computed 
and address 
modifications written to 
register filet 
Branch instructions: 
affects branch fetch 




45 








packet In PG phased 
Single-cycle 
instmctions: results are 
written to a register filet 





tThis assumes that the conditions for the instructions are evaluated as true. If the condition is evaluated as false, the instnjctbn will not write any 
SO resufts or have any pipeline operation after El . 



55 
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Table 9. (continued) 





Pipeline Phase Description 


5 


Pipeline 


Pipeline Phase 


Symbol 


During This Phase 


Instruction Types 
Completed 


10 
IS 
20 




Execute 2 


E2 


Load instructions: 
address is sent to 
memoryt 

Store instructions and 
STP: address and data 
are sent to memory'^ 
Single-cycle 

Instnjctlons that saturate 
results set the SAT bit in 
the Control Status 
Register (CSR) if 
saturation occurs, t 
Multiply instructions: 
results are written to a 
register filet 


Stores 
STP 

Multiplies 


25 




Execute 3 


E3 


Data memory accesses 
are perfomied. Any 
multiply instruction that 
saturates results sets the 
SAT bit in the Control 
Status Register (CSR) if 
saturation occurs, t 




30 




Execute 4 


E4 


Load instructions: data is 
brought to CPU 
boundaryt 








Execute 5 


E5 


Load instructions: data is 
loaded into registert 


Loads 


35 


trhts assumes that the condlttons for the Instructions are evahiated as In 
results or have any pipeline operation after El . 


le. If the condition Is evaluated as false, the Instructton will not write any 



[0064] The pipeline operation of the Instructions can be categorized into seven types shown in Table 1 0. The delay 
slots for each instruction type are listed in the second column. 

40 

Table 10. 



50 



55 



Delay Slot Summary 


Instruction Type 


Delay Slots 


Execute Stages Used 


Branch (The cycle when the target enters El) 


5 


El-branch target El 


Load (LD) (Incoming Data) 


4 


E1 -E5 


Load (LD) (Address Modification] 


0 


E1 


Multiply 


1 


El -E2 


Single-cycle 


0 


El 


Store 


0 


E1 


NOP (no execution pipeline operation) 






STP (no CPU internal results written) 







[0065] The execution of instructions can be defined in terms of delay slots (Table 1 0). A delay slot is a CPU cycle 
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that occurs after the first execution phase (El) of an instruction in which results from the instruction are not available. 
For example, a multiply instruction has 1 delay slot, this means that there is 1 CPU cycle before another instruction 
can use the results from the multiply instruction. 

[0086] Single cycle instructions execute during the El phase of the pipeline. The operand Is read, operation is per- 
5 formed and the results are written to a register all during El . These instructions have no delay slots. 
[0067] Load instructions have two results: data loaded from memory and address pointer modification. 
[0068] Data loads complete their operations during the ES phase of the pipeline. In the El phase, the address of the 
data is computed. In the E2 phase, the data address is sent to data memory. In the E3 phase, a memory read is 
performed. In the E4 stage, the data is received at the CPU core boundary. Finally, in the E5 phase, the data is loaded 
10 into a register Because data is not written to the register until E6, these instructions have 4 delay slots. Because pointer 
results are written to the register in E1 , there are no delay slots associated with the address modification. 
[0069] Store instructions complete their operations during the E3 phase of the pipeline. In the El phase, the address 
of the data is computed. In the E2 phase, the data address is sent to data memory. In the E3 phase, a memory write 
is performed. The address modification is performed in the El stage of the pipeline. Even though stores finish their 
IS execution in the E3 phase of the pipeline, they have no delay slots and follow the following mies ( i = cycle): 

1] When a load is executed before a store, the old value is loaded and the new value is stored. 

2) When a store is executed before a load, the new value is stored and the new value is loaded. 

3) When the instructions are in are in parallel, the old value is loaded and the new value is stored. 

20 

[0070] As discussed eariier, in this embodiment of the present invention non-aligned load and store instructions are 
perfonned by using resources of one D unit and both target ports T1 and T2, as will be described in more detail below. 
Advantageously, the second D unit is available to execute a Boolean or arithmetic instmction in parallel with the exe- 
cution of a non-aligned load/store instruction. Aspects of non-aligned memory accesses will now be described in more 
25 detail. 

[0071 ] Figure 7 is a memory map of a portion of the memory space of the DSP 1 and illustrates various aligned and 
non-aligned memory accesses. This portion of memory can be at any address YYYYYNNXh, but only the portion of 
the address represented by NNXh will be refen^ed to herein, for convenience. Furthemiore, the addresses used in the 
following discussion are only for example and are not Intended to limit the invention In any manner. 

30 [0072] DSP 1 can access both target ports T1 , T2 of data memory 22 by executing two aligned load or store instruc- 
tions in parallel, as discussed above. For example, a double word 700 at address 700h and a double word 708 at 
address 708h can be accessed by two load double word (LDDW) instructions executed in parallel using .01 and .D2 
and target ports T1 and T2. Likewise, word 780 and half word 786 can be accessed by executing a load word (LDW) 
instruction and a load half word (LDH) instruction in parallel using .D1 and .02 and target ports T1 andT2. 

55 [0073] Advantageously, this embodiment of the present invention utilizes the two target ports and two address buses 
DAI. DA2 to perform a non-aligned access. For example, double word 721 at address 721 h is non-aligned by one 
byte. Double word 74Da-74Db at address 74Dh is located in two different rows of the memory. Single word 7B7 located 
at address 7B7h is non-aligned by three bytes. Advantageously, each non-aligned access is perfomied in the same 
amount of time as each aligned access, unless the data word is not present in memory 22 and must be retrieved from 

^ secondary memory storage, such as off -chip memory 70 of Figure 1 . 

[0074] Unifomi access time is important for software programs that operate in real time, such as are commonly 
executed on DSPs. The problem for real time comes when a loop wallcs a data structure by a stride related to the 
cache/SRAM line size. If the structure starts at an offset such that the unaligned access doesnt require access outside 
of the single line, the loop runs quickly since every access runs without the stall. If the starting offset is such that the 

^ nonaligned load crosses the line boundary, there is a stall on every access. The same loop might run twice as long 
this time. If a real-time system is designed for the longer loop time, then twice as much performance is being sacriffeed 
most of the time. 

[0075] Figure 8 is a block diagram illustrating D-unit address buses of DSP 1 in more detail and illustrating two target 
ports T1 . T2 of DSP memory 22. An aspect of the embodiment of the present Inventton Is that load/store unit .01 can 

so generate an address for a non-aligned access and provide it on address bus DAI via address signals 800 and multi- 
plexer 200a, and simultaneously generate a contiguous address that is greater by the data size and provide it on 
address bus DA2 via address signals 801 and multiplexer 200b, as depicted in Table 11 . In this embodiment of the 
invention, load/store unit .02 can also generate an address for a non-aligned access and simultaneously generate a 
contiguous address incremented by the data size and provide them to address buses DA1 and DA2 via address signal 

55 lines 810 and 811 and multiplexors 200a and 200b, respectively. However, in an alternative embodiment, only one 
load/store unit may be so equipped. In yet another embodiment, there may be more than two 0 units so equipped, for 
example. 

[0076] In this embodiment of the invention, DSP 1 supports non-aligned memory loads and stores for words and 



15 



EP 1 126 368 A2 



■^Lh Zlw/ °7 "°"-al'gned access can be performed in a single cyde because both target ports T1 , T2 are 
usedlo load/store part of the data. From the memory designer^* perspective, the two memory operations due a non- 
!S K^^' are indistinguishable from two memory operations resulting from two instructions executed in parallel 
and in both casM the same memory ordering properties apply. The DSP simply requests an aligned access to each 
arget port T1 . 12 and byte strobes accompany data that must be written. Table 11 shows the accesses that are per- 
fomied as a result of non-aligned accesses. Alternative embodiments of the present invention may support other data 

sizes fornon-aligned access. An altemativeembodimerrt of thepresent invention mayprovlde the addresses in another 
torn, such as a byte address wtthout being truncated to the nearest word address, for example. Advantageously, 
memory 22 bank conflicts do not occur during non-aligned access. 



Table 11. 



iMon^Aiignea iv 


emory Access 


Request Size 


Non-Aligned Byte Offset 


DA2 Byte Address 


DAI Byte Address 


Word 


0x1 to 0x3 


0x4 


0x0 


Word 


0x5 to 0x7 


0x8 


0x4 


Word 


0x9 to OxB 


OxC 


0x8 


Word 


OxD to OxF 


0x10 


OxC 


Word 


0x11 to 0x13 


0x14 


0x10 


Word 


0x16 to 0x17 


0x18 


0x14 


Word 


0x1 9 to 0x1 B 


OxIC 


0x18 


Word 


OxIDtoOxlF 


0x20 


OxIC 


Doubleword 


0x1 to 0x7 


0X8 


0x0 


Ooubleword 


0x9 to OxF 


0x10 


0x8 


Doubleword 


0x11 to 0x17 


0x18 


0x10 


Doubieword 


0x1 9 to 0x1 F 


0x20 


0x18 1 



. w ? ^ ^^^'^"^ °' '"^"'y ^'^"'^ * '■""''»^«"9 ^'^'^^ *«>*ng of the two target ports 
II; f circuitry to extract a non-aligned data item according to an aspect of the present invention 

da a signals 901 and to load data signals 902 that are connected respectively to load data buses LD1 a.b and LD2a 
ril Jir. ° P'^"^"' ^^^'^ ^'9ht memory banks 940-947 that each store sixteen bits of 

data^ that two sets of 64 bit data can be selected and provided on load data signals 901. 902. Address ports 921 
and 922 each receive an address from address buses DAI and DA2, respectively and provide a portion of the address 
to separate inputs on address multiplexers 950-957 that provide addresses to the memory ba.ks. Decode clrcuitrj 

IZT^^ ^ r ^ * '^'"^^ « '"tended for memory 22. Decode 

signals 932 are fomied by decoder 930 and sent to address multiplexors 950-957 to select whteh address is provided 
10 eacn memory bank. 

S"?'®! "'^""'^ ^° '^"^^ ^ ^1 fr"'" instruction decode circuitry 10c of DSP 

Isb'S« rr^^ ^"'K T"" """^ ^'"^"^ ^ ^^^P"*"^ *° *'3"«'« and four 

«m .Tf '^"V .^^•r" ^'^'^^ DAI , DA2, decode circuitry 930 forms byte selection signals 933 that are 
semtobyteselectioncircui^ 

pla^ the requested byte, half word, word or double word on the appropriate set of load data s^nals 901 , 902 i a 
right aligned manner in response to byte selection signals 933. 

P»791 When a non-aligned k>ad request is being executed, byte selection circuitry 91 0 places the selected word or 
double word on the appropriate set of load data signals 901 or 902 in response to byte selection signals 933 For 

l^nTi^^lT"" K °" '""'^ "'^^ '""^ corresponding to byte addresHs 

bus DA2 and \m bytes are selected corresponding to byte addresses 750h-754h. Note that the address provided on 
DA2 IS a value of 8h greater than the aligned address on DAI , corresponding to the eight byte size of the requested 
non-aligned data item. These eight bytes are then right aligned and provided on load data signals 901 if register file A 
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20a is the specified destination of the transfer or on load signals 902 if register file 20b is the specified destination of 
the transfer. In this enibodiment, the load data bus LDx that is not associated with the specified destination register 
file remains free so that an associated .S unit can use the shared register file write port. 

[0080] Figure 1 0 is a block diagram illustrating load byte selection circuitry 91 0, also referred to as extraction circuitry, 
5 of Figure 9 in more detail. For simplicity, only byte select multiplexors 1000-1007 connected to load data byte lanes 
901 (0)-901 (7) are shown for simplicity. Another similar set of multiplexors is connected to load data signals 902. Se- 
lected ones of byte selection signals 933 are connected to each multiplexor to select the appropriate one of sixteen 
bytes provided by the memory banic an^y. 

[0081] Figure 11 is a block diagram illustrating the store byte selection circuitry of the memory system Figure 8 in 
10 more detail. For purposes of this document, the store byte selection circuitry is also refen^ed to as insertion circuitry 
for storing a non-aligned data item Into the memory subsystem. Pipe 1 store data signals 1121 provide store data from 
store data buses STIa.b to byte selection multiplexors 1100-1115. Likewise, Pipe 2 store data signals 1122 provide 
store data from store data buses ST2a,b to byte selection multiplexors 1 1 00-11 1 5. Control signals (not shown) provided 
to each byte multiplexor from decode circuitry 930 selects the appropriate one of sixteen bytes and presents each 
IS selected byte to the respective memory bank 940-947. Write signals byte0-byte1 5 are asserted as appropriate to cause 
a selected byte to be written into the respective memory bank. 

[0082] In this embodiment of the present invention, the load byte selection circuitry and the store byte selection 
circuitry is required to support the various aligned accesses available via each of the target ports T1 , T2. Advanta- 
geously, a single non-aligned access can be supported with only minor changes to the byte selection circuitry. Advan- 
ce tageously, all of the memory address decoding circuitry and memory banks do not need any modifcation and execute 
a non>aligned access simply as two aligned accesses in response to the two addresses provided on address buses 
DAI and DA2. 

[0083] Figure 12A is a block diagram of a load/store .D unit, whch executes the load/store instructions and perfomis 
address calculations. The .D unit receives a base address via first source input srcl . An offset value can be selected 

25 from either a second source input src2 ortrom afield In the instruction opcode, indicated at 1200. An address is provided 
on address output 1202 that is in turn connected to at least one of address multiplexors 200a,b. Additionally, an aug- 
mented address is provided on address output 1204 for non-aligned accesses. The augmented address is incremented 
by a byte address value of either four or eight as selected by multiplexer 1210 in response to the line size of the 
Instruction being executed: four is selected for a word instnictlon and eight Is selected for a double word Instruction. 

30 Adder 1212 Increments an address on signal lines 1213 by the amount selected by multiplexer 1210 to fomn the aug- 
mented address that is provided on signal lines 1214. This contiguous address is provided on address output 1204 for 
a non-aligned access and is connected to the other address multiplexor 200a,b, as discussed previously. A calculated 
address value is also provided to the output dst to update a selected base address register value in the register file 
when an increment or decrement address mode is selected. According to an aspect of the present invention, the 

35 address on signal lines 1213 and the augmented address on signal lines 1214 are passed through circular buffer 
circuitry 1^0 prior to being output on 1202, 1204 so that they can be bounded to remain within a circular buffer region. 
[0084] In this embodiment. Load and Store instructions operate on data sizes from 8 bits to 64 bits. Addressing 
modes supported by the .D unit are basic addressing, offset addressing, scaled addressing, auto-increment/auto-dec- 
rement, long-immediate addressing, and circular addressing, as defined by mode field 500. In basic addressing mode, 

40 the content of a selected base register is used as a memory address. In offset addressing mode, the memory address 
is detennined by two values, a base value and an offset that Is either added or subtracted from the base. Referring 
again to Figure 3A and Figure 3B, the base value always comes fronri a base register specified by a field 514 "base' 
R" that is any of the registers in the associated register file 20a or 20b, whereas the offset value may come from either 
a register specified by an "offset R" field 51 6 or a 5-bit unsigned constant UCST5 contained in field 51 6 of the Instruction 

^ via signals 1200. Certain load/store Instructions have a long Immediate address mode that uses a 15-blt unsigned 
constant contained In the instruction (not shown in Figure 3). A selected offset is provided on signal lines 1218 to shifter 
1220. Scaled addressing mode functions the same as offset addressing mode, except that the offset is interpreted as 
an index into a table of bytes, half-words, words or double-words, as indicated by the data size of the toad or store 
operation, and the offset Is shifted accordingly by shifter 1220 in response to control signals 1226 whteh are derived 

so by decoding opcode field 51 0, 51 2 of the LD/ST instructions. 

[0085] In this embodiment of the present invention, an SC bit 520 in load/store non-aligned double word (LDNDW/ 
STNDW) instruction controls shifter 1220 so that an offset can be used directly, refen-ed to as unsealed, or shifted by 
an amount con-esponding to the type of instruction, refen-ed to as scaled. Scaled/unsealed control signal 1 224 is derived 
by decoding the SC field 520 of LDNDW/STNDW instructions. If 80 field 520 is a logbal 0, then the offset is not scaled 

55 and signal 1224 is deasserted. If SC field 520 is a logk:al 1, then the offset is scaled and signal 1224 is asserted. In 
this embodiment, for instmctions other than LDNDW/STNDW. signal 1224 Is asserted so that scaling will be performed 
according to data size control signals 1226. 

[0086] In auto-lncrement/decrement addressing mode, the base register is Incremented/ decremented after the ex- 
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ecution of the load/store instruction by Inc/dec unit 1222. There are two sub-modes, pre-incremenVdecrement, where 
the new value in the base register is used as the load/store address, and post-increment/decrement where the original 
value In the register is used as the load/store address. In long-immediate addressing mode, a 1 5-bit unsigned constant 
Is added to a base register to detemnine the memory address. In circular addressing mode, the base register along 
with a block size define a region in memory. To access a memory location in that region, a new index value is generated 
from the original Index modulo the block size in circular addressing unit 1230. 

[0087] In this embodiment of the invention, a Boolean unit 1240 is provided and can be used for execution of logical 
instructions when the .D unit is not being used to generate an address. 

[0088] Figure 12B is a more detailed block diagram of circular buffer circuitry 1230 of Figure 12A. As explained 
earlier, circular mode addressing operates as follows with LD/ST Instructions: after shifting offsetR/cst to the left by 3, 
2, 1 . or 0 for LDDW, LOW, LDH, or LDB respectively, it is then added to or subtracted from baseR to produce the final 
address. This add or subtract is perfomied by only allowing bits N through 0 of the result to be updated, leaving bits 
31 through N+1 unchanged after address arithmetic. The resulting address Is bounded to 2^(N+1) range, regardless 
of the size of the offsetR/cst. Bounding can be performed In a number of ways, such as by interrupting a carry bit at 
the appropriate place of adder 1222. However, in order to support non-aligned accesses, both the address and the 
augmented address must be bounded separately. 

[0089] In the present embodiment, bounding circuitry 1250 bounds the address provided on signal lines 1213, while 
bounding circuitry 1260 bounds the augmented address provided on signal lines 1214. Mask generation circuit 1232 
forms a right extended mask (R-mask) in response to a selected block size from the AMR register, as described eariier, 
and provides it on bus 1234. A right extended mask has a "1" in every bit position corresponding to an address bit 
within the bounds of the 2a(N+1 ) range, and a "0" in every more significant address bit beyond this range. 
[0090] The R-mask is bit-wise ANDed with the address on bus 1213 in AND block 1252 to form a least significant 
portion of the address bounded within the 2^N+1 ) range. An inverted R-mask is bit-wise ANDed with the original base 
address on bus 1216 in AND block 1254 to fonn a most significant portion of the address above the 2^ (N+1) range. 
The most significant address portion and the bounded least significant address portion are bit-wise combined in OR 
block 1256 to form the final address that is output on bus 1215. The augmented address on bus 1214 is likewise 
bounded using AND blocks 1262, 1264 and OR block 1266 and then output on bus 1217. 
[0091] Advantageously by having two bounding circuits 1250, 1260 both address are formed in a parallel manner 
so that a non-aligned access to a circular buffer region Is perfomied In the saaie amount of time as an aligned access 
to a circular buffer region. 

[0092] Figure 12C is a flow chart illustrating formation of scaled and non-scaled addresses for accessing a linear 
region or a cifx:ular buffer region with either aligned or non-aligned accesses, according to an aspect of the present 
Invention. In step 600, a circular buffer region is setup In memory subsystem 22 by initializing the AMR register and 
an associated base register, as discussed above. 

[0093] In step 602. an instruction is fetched for execution. In this embodiment of the present invention, Instructions 
are fetched in fetch packets of eight instructions simultaneously during instructfon execution pipeline phases P/G. PS, 
PW and PR. Other embodiments of the present invention may fetch Instructions singly or doubly, for example,' In a 
different number of phases. 

[0094] In step 61 0, the instruction is decoded to forni a plurality of fields. In this embodiment, decoding is performed 
in two phases of the instruction execution pipeline, but in other embodiments of the present invention decoding may 
be perfomied on one or three or more phases. 

[0095] In step 620, a base-offset address for accessing a data item for the Instruction Is fonned by combining In 627 
a base address value and an offset value, such that the offset value is selecth/ely scaled or not scaled. Step 627 may 
Include post or pre-incrementlng or decrementing, for example, as indicated by mode field 500. In 621 or 622, for a 
non-aligned double word load or store instruction (LD/STNDW) the offset value is scaled by shifting left three bits only 
If the SC field 520 has a value of 1 . If SC field 520 has a value of 0, then the offset value is not scaled and is therefore 
treated as a byte offset. If the instruction is a load or store double, then the offset is scaled by left shifting three bits in 
step 623 to fomi a double word offset. If the Instruction is a LD/ST word, then the offset is scaled by shifting left two 
bits In step 624 to fonn a word offset If the Instruction Is a half word LD/ST Instruction, then the offset Is scaled by 
shifting left one bit in step 625 to form a half word offset. If the instruction Is a bytff LD/ST instruction, then the offset 
is scaled by shifting zero bits in step 626 to form a byte offset. In the present embodiment, the scaling amount Is 
determined by opcode field 510, 512 that specifies the type of LD/ST Instruction. In another embodiment, there may 
be a field to specify operand size, for example. In the present embodlnnent, step 620 is perfomied during the El pipeline 
phase. 

[0096] In step 630, if the instruction fetched in step 602 is an aligned type instruction, then the base-offeet address 
from step 627 is concatenated to stay within the boundary of the circular buffer region specified in step 602, if circular 
addressing Is specified by the AMR for the base register selected by the Instruction. In step 632. the resultant address 
is sent to the memory subsystem during pipeline phase E2. If circular addressing is not selected, then the base-offset 
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from step 627 is used to access memory during pipeline phase E2. 

[0097] In step 640. if the instruction fetched in step 602 is a non-aligned type instruction, then a line size is added 
to the base-offset address from step 627 to form an augmented address. The line size is detem^ined by the instruction 
type decoded in step 61 0. For a double word Instruction type, the line size Is eight bytes. For a word instruction type, 
the line size is four bytes. 

[0098] In step 650, the base-offset address from step 627 Is concatenated to stay within the boundary of the circular 
buffer region specified in step 602, if circular addressing is specified by the AMR for the base register selected by the 
instruction. In step 652, the resultant address is sent to the first port of the memory subsystem during pipeline phase 
E2. If circular addressing is not selected, then the base-offset from step 527 is used to access memory during pipeline 
phase E2. Likewise, in steps 661 and 653, the augmented is selectively bounded if circular addressing Is selected and 
a second address is sent to the second port of the memory subsystem during pipeline phase E2. 
[0099] During step 654. the requested non-aligned data item is extracted from the two aligned data items accessed 
in steps 652, 653. 

[01 00] An assembler which supports this embodiment of the invention defaults increments and decrements to 1 and 
offsets to 0 if an offset register or constant is not specified. Loads that do not modify to the baseR can use the assembler 
syntax 'R. Square brackets, I ], indicate that the ucst5 offset is left-shifted by 3 for double word loads. Parentheses. 
( ), are be used to tell the assembler that the offset is a non-scaled offset. For example, LDNDW (.unit) +baseR (14)i 
dst represents an offset of 1 4 bytes and the assembler writes out the instnjction with offsetC = 1 4 and sc = 0. Likewise,' 
LDNDW (.unit) VbaseR [16] dst represents an offset of 16 double words, or 128 bytes, and the assembler writes out 
the instruction with offsetC = 1 6 and sc = 1 . 

[0101] In this embodiment, LD/STDW instructions do not include an SC field, but LDNDW and STNDW Instruction 
do Include an SC field. However, parentheses, ( ), are used to tell the assemblerthat the offset is a non-scaled, constant 
offset. The assembler right shifts the constant by 3 bits for double word stores before using it for the ucst5 field. After 
scaling by the STDW instruction, this results in the same constant offset as the assembler source if the least significant 
three bits are zeros. For example, STDW (.unit) src, VbaseR (16) represents an offset of 16 bytes (2 double words), 
and the assembler writes out the instruction with ucst5 = 2. STDW (.unit) src, VbaseR [16] represents an offset of 16 
double words, or 1 28 bytes, and the assembler writes out the instruction with ucstS =16. 
[0102] Refen^ing again to step 620 of Figure 6, the SC bit (scale or not scaled) affects pre/post incrementing. If a pre 
or post increment/ decrement Is specified, then the Increment/decrement amount Is controlled by the SC bit. In non- 
scaled mode, the increment/decrement corresponds to a number of bytes. In assembly code, this would be written as 
shown in Table 12, example 1 and 2. In both of these cases, regl ends up with the value "regl + reg2". 
[0103] In scaled mode, the Increment^decrement con-esponds to a number of double-words. The assembly syntax 
for this is shown in Table 12, examples 3 and 4. In both of these cases, regl ends up with the value "regl + 8>eg2". 
That is, reg2 is "scaled" by the size of the access. 

[0104] These comments also apply to the integer offset modes as well, as Illustrated in Table 12, examples 5-8. 
Likewise, similar examples apply to the pre/post decrement instructions. 



Table 12. 



Examples of Instructions With Various Pre/Post Increment, Scaled and Non-Scaled Addressing fi^odes 


example 


Instnjction syntax 


operation 


1 


LDNDW 

*++reg1(reg2). reg3 


pre-lncrement, non-scaled 


2 


LDNDW 

*reg1++(reg2), regS 


post-increment, non-scaled 


. 3 


LDNDW 

*-H-reg1[reg2], reg3 


pre-increment, scaled. 


4 


LDNDW 

'reg1++{reg2], reg3 


post-increment, scaled. 


5 


LDNDW 

*-M-reg1(cst5), reg2 


pre-increment, non-scaled 


6 


LDNDW 

*reg1++(cst5), rBg2 


post-increment, non-scaled 
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Table 12. (continued) 



Examples of Instructions With Various Pre/Post Increment, Scaled and Non-Scaled Addressing Ixodes 


example 


Instruction syntax 


operation 


7 


LDNDW 

*++reg1[cst5], reg2 


pre-increment. scaled 


8 


LDNDW 

*reg1++{cst5]. reg2 


post-increment, scaled 



[0105] An advantage of scaled vs. non-scaled for the integer offset modes is that scaled provides a larger range of 
access whereas non-scaled provides finer granularity of access. Typically, when large offsets are used, they're multiples 
of the access size already. When small offsets are used, they're typically not, since typically a short moving distance 
is desired. 

[0106] Scaled vs. non-scaled in register-offset modes is advantageous as well, but for different reasons. In scaled 
mode, the register offset usually corresponds to an array index of some sort. In non-scaled mode, the register offset 
may con-espond to an image width or other stride parameter that isnl a multiple of the access width. For instance, 
accessing a 2 dimensional array whose row width is not a multiple of 8. 

[0107] Figure 13 is a blocl< diagram of an alternative embodiment of a digital system 1300 with processor core 1301 
similar to CPU 1 0 of Figure 1 . A direct mapped program cache 1 71 0, having 1 6 Kbytes capacity In memory 1 71 Ob. Is 
controlled by L1 Program (LI P) controller 1710a and connected thereby to the instmction fetch stage 1 0a, A 2-way 
set associative data cache 1 720, having a 1 6 Kbyte capacity i n memory 1 720b, is controlled by L1 Data (LID) controller 
1 720a and connected thereby to data units D1 and D2. An L2 memory 1 730 having four banks of memory, 1 28 Kbytes 
total, is connected to LIP 1710a and to LID 1720a to provide storage for data and programs. External memory interface 
(EMIF) 1750 provides a 64-bit data path to extemal memory, not shown, which provides memory data to L2 memory 
1730 via extended direct memory access {DMA) controller 1740. 

[0108] EMIF 1752 provides a 16-bit interface for access to extemal peripherals, not shown. Expansion bus 1770 
provides host and I/O support similarly to host port 60/80 of Figure 1 . 

[0109] Three multi-channel buffered serial ports (McBSP) 1760, 1762, 1764 are connected to DiyflA controller 1740. 
A detailed description of a McBSP is provided in U.S. Patent S.N. 09/055,011 fTI-2B204, Seshan, et al). 
[0110] Advantageously, non-aligned accesses to a data cache 1 720 is performed in the same amount of time as an 
aligned access to data cache 1 720, as long as a miss does not occur. Lil<ewise, advantageously, non-aligned accesses 
to a circular buffer region in data cache 1 720 is performed in the same amount of time as an aligned access to a circular 
buffer region in data cache 1720, as long as a miss does not occur. 

[01 1 1 ] Figures 1 4A and 1 48 together Is a block diagram of data cache LI D 1 720b of Figu re 1 3. L1 D is a 1 6K byte 
2-way associative cache that has 128 sets and a line size of 32 bytes. The data bus interface from L2 to LI D is 32 
bytes wide (32 total, not 32 for A side and 32 for B side) - it takes one clock cycle to send a line from L2 to LI D. The 
store path from LI D to L2 is 16 bytes wide (16 total, not 16 for A side and 16 for B side), resulting in a 4 clock cycle 
line eviction. LI D operates solely as a cache and is not memory-mapped. Table 1 3 sunrvnarizes the features of cache 
L1D. 



Table 13. 



50 



LI D Cache Features 


Size 


16K bytes 


Associativity 


2-way 


Line Size 


32 bytes 


L2To LID Load Bus Width 


32 bytes 


LID To L2 Store Bus Width 


16 bytes 


Accesses Per Clock Cycle 


2 


Cycles For LID Fill 


1 


Cycles For L1D Line Eviction 


4 


Interleave/Bank Size 


16 bit 
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Table 13. (continued) 



L1D Cache Features 


Size 


16K bytes 


Number Of Interleaves/Banks 


16 


Tag Array SRAM Size 


4-128x19 


Data An^ay SRAM Size 


16-512x16 


Write Hit Policy 


Write back w/Write-through tags 


Write Miss Policy 


no allocate 


Replacement Strategy 


LRU 



IS [0112] LI D must carry four status bits tor each cache line, including a valid bit, a modified bit, an NW (no-wrrte) bit, 
and an LRU bit The valid bits, modified (dirty) bits, and NW nnay reside In the tag RAM. The LRU bits reside In registers 
for single-cycle read-modlfy-write operatk)n. 

[0113] Read misses In L1 D cause the corresponding line to be brought into the cache. The line originates from L2 
if It is contained there; otherwise, the I/O subsystem transfers the line to L2 from another memory-mapped location, 

20 and the line is fonvarded to LI D. LI D must keep track of the modification state of each of its lines. If an LI D line has 
been modified, it must be written out to L2 when it is replaced (or when requested to do so by an L2 control register write). 
[01 141 LID uses an LRU replacement strategy. A given line can reside at the same address offset within either way 
0 or way 1 of the cache. Each line has an LRU bit that keeps track of which way was most recently used for the 
corresponding line. When a new line must replace a line that is already stored in the cache, it replaces the line that 

25 was least recently used and modifies the LRU status bit accordingly. If one of the two candidates for replacemsnt is 
invalid, it is replaced without regard to LRU status, and the LRU bit indicates that way holding the new line is most 
recently used. 

[01 15] The LI D controller 1 720a must not use an unitialized value of the LRU bit that is present following power-up/ 
reset. Invalidate operations do not affect LRU status. 
30 [01 1 6] Read hits allow the OSP to continue execution without stalling. The LRU status might need be updated. 
[0117] Write misses do not cause allocation in LI D; the write data is sent to a write buffer to await transfer to 12. 
Provided that the write buffer is not full, the DSP does not need to stall on write misses to L1D. The LRU status is 
unaffected by write misses. 

[0118] On a write hit, data is written into the cache, and the DSP continues to execute without stalling. If the LID 
55 line has not yet been modified, LID must also perfonm a tag writethrough to L2, sending the tag value of the corre- 
sponding line to L2. L2 stores a copy of the tag for each line in LI D along with a dirty bit for the line, and it must be 
informed that the LID line has been modified. With this mechanism L2 monitors the status of LID data without the 
complexity or conflicts caused by snoop traffic. Tag writethrough should have a single cycle throughput, assuming no 
conflicts. The LRU status might need to be updated due to a write hit. 
40 [0119] LID adheres to the pipeline timing described in Table 14. 



Table 14. 



Pipeline Description 


Stage 


Module 


Action 


E1 


DSP 


Register file read and address generate 


E2 


DSP 


Send address and data (if a store) to L1 D 


E2 


LID 


Receive data and perform lag lookup to initiate an address compare 


E3 


L1D 


If address matches and tag is valid, initiate data RAM read or write. Else send data to write 
buffer for stores or stall DSP for loads 


E4 


DSP 


Receive load data from LI D 


E4 


L1D 


Send load data to DSP 


E5 


DSP 


Write load data into register file 



[0120] Figure 14a is a block diagram illustrating an implementation of the L1D tag array. Logically, 4 tag RAM's 
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1401-1404 are required for LID. Both ways of the cache require a tag RAM, and the RAM's are replicated to allow 
both D units of the DSP to simultaneously perform tag lookups on different addresses. A first address is provided by . 
D1 on a first address bus 1410 and accesses tag RAMs 1401-1402. A second address Is provided by .D2 on bus 1411 
and accesses tag RAMs 1403-1404. When a line is replaced, the tag RAM's associated with both D units must be 
updated to maintain duplicate copies of the tags. A three-input mux 1420, 1421 Is provided to allow tag comparisons 
to occur one cycle earlier. 

[0121 1 The individual tag RAM's are 1 28 x 20. One bit of the 20 bits is used to store a physical address attribute map 
(PAAM) NW (no-write) bit. Aspects of a PAAM are described in detail in co-assigned U.S. Patent application Number 
09/702,477 entitled System Address Properties Control Cache Memory, Including Physical Address Attrtoute Map 
(PAAM). The PAAM PC and NR bits do not need to be stored in LI D. 

[01221 Figure 1 4B Is a block diagram Illustrating an Implementation of the L1 D data an-ay 1 720b. Note that the 11 D 
Data diagram omits the logic for a write buffer. LI D contains a buffer of at least 58 bytes that resides between LID 
and L2, and it is used for both victims and LID write misses. The buffer can hold one 32 byte victim plus two LID 
doubleword write misses, or It could hold up to ten L1 D doubleword misses without a victim. 
[0123] Each D unit of the DSP can access LI D 1720 with a load or store of a byte, halfword, worti, or doubleword. 
When the two D units request a simuftaneous access, there may or may not be stalls as a result of bank conflicts. 
Thirty-two bytes are available per cycle (an Individual bank holds data for both ways of the cache). A doubleword 
access can occur at one of four offsets within a line, including 0,8,16, and 24 bytes. Assuming random behavior, there 
Is a 25% chance of a conflict between two simultaneous doubleword accesses. Like doubleword requests, byte, half- 
word, and word requests may or may not have bank confltots, but they are less likely to happen and easier to avoid 
from a programming standpoint 

[0124J Advantageously, L1 D cache 1 720b is not affected by a single non-aligned word or double wond load or store 
transaction since a non-alrgned transaction is presented to the cache as two aligned transfers. Byte swapping circuitry 
similar to Rgure 1 0 and Figure 11 is included In controller block 1 720a to provide alignment. 
[0125J Figure 15 illustrates an exemplary implementation of a digital system that includes DSP 1 packaged in an 
integrated circuit 40 in a mobile telecommunications device, such as a wireless telephone 15. Wireless telephone 15 
has integrated keyboard 12 and display 14. As shown in Figure 15, DSP 1 is connected to the keyboard 12, where 
appropriate via a keyboard adapter (not shown), to the display 1 4, where appropriate via a display adapter (not shown) 
and to radio frequency (RF) circuitry 16. The RF circuitry 16 is connected to an aerial 18. Advantageously, by allowing 
non-aligned accesses into linear regions or circular buffer regions in the memory subsystem of DSP 1 , complex signal 
processing algorithms can be written in a more efficient manner to satisfy the demand for enhanced wireless telephony 
functionality. More importantly, non-aligned accesses into linear and circular buffer regions take the same amount of 
time as aligned access Into the same regions, so that real time algorithms operate In a consistent, predictable manner. 
[0126] Table 1 5 summarizes instruction operation and execution notations used throughout this document. 



Table 15. 



Instruction Operation and Execution Notations 


Symbol 


Meaning 


long 


40-bit register value 


4-a 


Perform twos-complement addition using the addressing mode defined by the AMR 


-a 


Perform twos-oomplement subtraction using the addressing mode defined by the AMR 


xor 


Bitwise exclusive OR 


not 


Bitwise logical complement 


by.. 


Selection of bits y through z of bit string b 


»6 


Shift right with sign extension 


»2 


Shift right with a zero fill 


xdear b,e 


Clear a field in x, specified by b (beginning bit) and e (ending bit) 


X exts l.r 


Extract and sign-extend a field in x. specified by 1 (shift left value) and r (shift right value) 


X extu l,r 


Extract an unsigned field in x, specified by 1 (shift left value) and r (shift right value) 


+s 


Perform twos-complement addition and saturate the result to the result size, if an overt tow or 
underflow occurs 
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Table 15. (continued) 



Instruction Operation and Execution Notations 


Symbol 


Meaning 


-s 


Perform twos-complement subtraction and saturate the result to the resuH &m if an Auarfinuf nr 
underflow occurs 


X set b.e 


Set field In x, to all Is specified by b (beginning bit) and e (ending bit) 


InnbO(x) 


Leftmost 0 bit search of x. 


Innbl(x) 


Leftmost 1 bit search of x 


norm(x) 


Leftmost nonredundant sign bit of x 


abs(x) 


Absolute value of x 


and 


Bitwise AND 


bi 


Select bit i of source/destination b 


bit^cou nt 


Count the number of bits that are 1 in a specified byte 


bit_rev erse 


Reverse the order of bits in a 32-bit register 


byteO 


8-bit value in the least significant byte position in 32-bit register (bits 0-7) 


bytel 


8-bit value in the next to least significant byte position in 32-bit register (bits 8-15) 


byte2 


8-blt value In the next to most significant byte position In 32-blt register (bits 16-23) 


byte3 


8-brt value in the most significant byte position in 32-bit register (bits 24-31) 


bv2 


Bit Vector of two flags for s2 or u2 data type 


bv4 


Bit Vector of four flags for s4 or u4 data type 


cond 


Check for either creg equal to 0 or creg not equal to 0 


creg 


3-blt field specifying a conditional register 


cstn 


n-bit constant field (for example, cst5) 


dst_h or dst_o 


msb32 of dst (placed in odd register of 64-bit register pair) 


dstj or dst_e 


Isb32 of dst (place in even register of a 64-bit register pair) 


dws4 


Four packed signed 16-bit integers in a 64-bit register pair 


dwu4 


Four packed unsigned 16-bit integers in a 64-bit register pair 


gmpy 


Galois Field Multiply 


12 


Two packed 1 6-blt Integers in a single 32-bit register 


14 


Four packed B-bit integers in a single 32-bit register 


int 


32-bit integer value 


Isbn or LSBn 


n least significant bits (for example, Isbl6) 


msbn or MSBn 


n most significant bits (for example. msb16) 


nop 


No operation 


or 


Bitwise OR 


R 


Any general-purpose register 


roti 


Rotate left 


sat 


Saturate 


sbyteO 


Signed 8-bit value in the least significant byte position in 32-bit register (bits 0-7) 


sbytel 


Signed 8-bit value in the next to least significant byte position in 32-bit register (bits 8-15) 
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Table 15. (continued) 



Instruction Operation and Execution Notations 


Symbol 


Meaning 


sbyte2 


Signed B-bit value in the next to most significant byte position in 32-bit register (bits 1 6-23) 


sbyte3 


Signed 8-bit value in the most significant byte position in 32-bit register (bits 24-31) 


scstn 


Signed n-bll constant field (for example, scst7) 


se 


Sign-extend 


sint 


Signed 32-bit integer value 


sIsbIG 


Signed 16-bit integer value in lower half of 32-bit register 


smsbie 


Signed 16-bit integer value in upper half of 32-bil register 


s2 


Two packed signed 1 6-bit integers in a single 32-bit register 


s4 


Four packed signed 8-blt integers in a single 32-blt register 


sllong 


Signed 64-bjt integer value 


ubyteO 


Unsigned 8-bit value in the least significant byte position in 32-bit register (bits 0-7) 


ubytel 


Unsigned 8-bit value in the next to least significant byte position in 32-bit register (bits 8-1 5) 


ubyte2 


Unsigned 8-bit value in the next to most significant byte position in 32-bll register (bits 1 6-23) 


ubyte3 


Unsigned 8-blt value in the most significant byte position in 32-bit register (bits 24-31) 


ucstn 


n-bit unsigned constant field (for example, ucstS) 


uint 


Unsigned 32-bit integer value 


ullong 


Unsigned 64-bit integer value 


ulsbl6 


Unsigned 16-bit integer value in lower half of 32-bit register 


umsb1 6 


Unsigned 16-bit integer value in upper half of 32-bit register 


u2 


Two packed unsigned 18-bit integers In a single 32-bit register 


u4 


Four packed unsigned 8-bit integers in a single 32-blt register 


Xl2 


Two packed 1 6-blt integers in a single 32-bit register that can optionally use cross path 


xi4 


Four packed 8-bit integers in a single 32-bit register that can optionally use cross path 


xsint 


Signed 32-bit integer value that can optionally use cross path 


xs2 


Two packed signed 1 6-bit Integers in a single 32-bit register that can optionally use cross path 


xs4 


Four packed s^ned 8-bit integers in a single 32-bit register that can optionally use cross path 


xuint 


Unsigned 32-blt integer value that can optionally use cross path 


xu2 


Two packed unsigned 1 6-bit integers in a single 32-bit register that can optionally use cross path 


xu4 


Four packed unsigned 8-bit integers in a single 32-bit register that can optionally use cross path 


-> 


Assignment 


+ 


Addition 


++ 


Increment by one 


x 


Multiplication 




Subtraction 


> 


Greater than 


< 


Less than 


« 


Shift left 
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Table 15. (continued) 



Instruction Operation and Execution Notations 


Symbol 


Meaning 


» 


Shift right 


>= 


Greater than or equal to 


<= 


Less than or equal to 




Equal to 




Logical Inverse 


& 


Logical And 



[0127] Fabrication of digital systenr* 1 and 1300 involves multiple steps of implanting various amounts of impurities 
into a semiconductor substrate and diffusing the Impurities to selected depths within the substrate to form transistor 
devices. Masks are fomied to control the placement of the impurities. Multiple layers of conductive material and insu- 
lative material are deposited and etched to Interconnect the various devices. These steps are performed in a clean 
room environment. 

[01281 A significant portion of the cost of producing the data processing device involves testing. While In wafer form, 
individual devices are biased to an operational state and probe tested for basic operational functionality. The wafer is 
then separated into individual dice which may be sold as bare die or packaged. After packaging, finished parts are 
biased into an operational state and tested for operational functionality. 

[0129] Thus, a digital system is provided with a processor having an improved instruction set architecture. The .D 
units can access words and double words on any byte boundary by using non-aligned load and store instructions, 
maintain the same instruction execution timing for aligned and non-aligned memory accesses. Advantageously, a sig- 
nificant amount of additional hardware is not required to perfomi the non-aligned accesses because a single non- 
aligned access can be performed using the resources of two separate aligned access ports. 
[01301 As used herein, the terms "applied." "connected," and "connection" mean electrically connected, including 
where additional elements may be in the electrical connectwn path. "Associated" means a controlling relationship, 
such as a memory resource that is controlled by an associated port. The tenrts assert, assertion, de-assert, de-asser- 
tion, negate and negation are used to avoid confusion when dealing with a mixture of active high and active low signals. 
Assert and assertion are used to indfcate that a signal is rendered active, or logically true. De-assert, de-assertion, 
negate, and negation are used to indicate that a signal is rendered inactive, or logically false. 
[01311 While the invention has been described with reference to illustrative embodiments, this description is not 
intended to be conslmed in a limiting sense. Various other embodiments of the invention will be apparent to persons 
skilled in the art upon reference to this description. For example, more than two target memory ports may be provided. 
Different data widths may be provided, such as 128-bit data items, for example. As long as the size of a non-aligned 
data item is less than or equal to the size of each aligned access port, then two access ports can be shared to provide 
a single non-aligned access without adding significant additional resources. 

[01321 Scaling circuitry may be included or not included within the address generation circuitry. Scaling/non-scaling 
can be selectively included in instructions for data sizes other than double words. 

[01331 Circular buffer address circuitry may be included or not included within the address generation circuitry. 
[01341 In the digital system according to the claims, the first instruction type is a non-aligned access type, and wherein 
the second Instruction type is an aligned access type. 

[01351 In t^e digital system according to the claims, the second load/store unit is operable to execute a non-memory 
access instruction in parallel with the first load/store unit accessing the memory subsystem for an Instojction of the 
first type. 

[01361 In the digital system according to the claims, the memory subsystem is a cache memory. 
[0137] In the digital system according to the claims, the microprocessor is a digital signal processor. 
[01381 'n the digital system according to the claims, the address cincuitiy comprises: combination circuitry connected 
to receive a base address value and an offset value, operable to combine the base address value and the offset value 
to fonn a base-offset address, wherein the base-offset address is selectively coupled to the first address output; and 
adder circuitry connected to receive the base-offset address and a line size value, operable to add the line size value 
to the base-offset address to fonn an augmented address, wherein the augmented address is selectively coupled the 
second address output. 

[01391 In the digital system according to the claims, the step of extracting loads a data value from the non-aligned 
data item into the mbroprocessor. 
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[01401 In the digital system according to the claims, the step of extracting stores a data value in the non-aligned data 
item in the memory subsystem by storing a first portion of the non-aligned data item in a first aligned data item and 
storing a second portion of the non-aligned data Item in a second aligned data item. 

[01 41 1 I n the digital system according to the claims, the first data item is an aligned data item and wherein the second 
data item is an aligned data item. 

[0142] A digital system embodying the present invention comprises: a microprocessor having at least a first load/ 
store unit; a memory subsystem having at least first memory port connected to the first load/store unit; address gen- 
eration circuitry In the first load/store unit having a first address output connected to the first memory port, the address 
generation circuitry operable to provide a first byte address on the first address output; and an extraction circuit con- 
nected to the first memory port, wherein the extraction circuit is operable to provide a first non-aligned multi-byte data 
item to the first load/store unit responsive to the first byte address. 



Claims 

1 . A digital system, comprising; 

a microprocessor having at least a first load/store unit and a second ioad/store unit; 

a memory subsystem having at least a first memory port connected to the first load/store unit and a second 

memory port connected to the second load/store unit; 

address generation circuitry in the first load/store unit having a first address output connected to the first 
memory port and a second address output selectively connected to the second memory port, the address 
generation circuitry operable to provide a first address on the first address output and a second address on 
the second address output; and 

an extraction circuit connected to the first memory port, wherein the extraction circuit is operable to provide a 
first non-aligned data item to the first load/store unit extracted from a first data item accessed in response to 
the first address and from a second data item accessed in response to the second address. 

2. The digital system of Claim 1 , further comprising Insertion circuitry connected to the first memory port, wherein 
the insertion circuitry is operable to receive a second non-aligned data item from the first load/store unit and to 
store a first portfon of the second non-aligned data item in a first data Item in the memory subsystem responsive 
to the first address and to store a second portion of the second non-aligned data item in a second data item in the 
memory subsystem responsive to the second address. 

3. The digital system according to any preceding Claim, wherein the add ress generation circuitry is operable to provide 
the first address and second address to the memory subsystem to access the memory subsystem in response to 
a first instruction type and to provide only the first address to the memory subsystem to access the memory sub- 
system in response to a second instruction type. 

4. The digital system according to any preceding Claim, wherein the second load/store unit comprises address gen- 
eration circuitry with a first address output selectively connected to the second memory port, such that the second 
load/store unit is operable to transfer a data item to the second memory port in parallel with the first load/store unit 
transferring a data item to the first memory port. 

5. The digital system according to any preceding Claim, wherein the address generation circuitry of the second load/ 
store unit Is operable to provide the first address on a second address output selectively connected to the first 
memory port and the second address on the first address output for accessing the memory subsystem for an 
instruction of the first type. 

6. The digital system according to any preceding Claim being a cellular telephone, further comprising: 

an Integrated keyboard connected to the CPU via a keyboard adapter; 
a display connected to the CPU via a display adapter; 
radio frequency (RF) circuitry connected to the CPU; and 
an aerial connected to the RF circuitry. 

7. The digital system according to any preceding Claim, wherein the memory subsystem comprises: 
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a plurality of memory banks connected to the extraction circuitry; 
decode circuitry connected to the first memory port and to the second memory port; and 
a plurality of address multiplexers connected respectively to the plurality of memory banks with first input of 
each of the plurality of multiplexers connected a receive an address from the first memory port and a second 
input connected to receive an address from the second memory port, each of the plurality of address multi- 
plexers having a select control separately connected to the decode circuitry, such that the decode circuitry is 
operable to Individually control each of the plurality of address multiplexers. 

8. A method of operating a microprocessor, comprising the steps of: 

fetching a first instruction for execution, wherein the first instruction is a non-aligned access type instruction 
and wherein the first instruction references a non-aligned data Item in a memory subsystem region; 
decoding the instruction to fonm a plurality of fields; 

forming a first address and accessing a first data item via a first port of the memory subsystem; 
fonning a second address and accessing a second data item via a second port of the memory subsystem, 
such that the first address and second address are fomied in a simultaneous manner and such that the first 
data item and the second data item are accessed in a simultaneous manner: and 
extracting the non-aligned data item from the first data item and the second data item. 

9. The method of Claim 8, wherein the step of forming a first address comprises the step of combining a base address 
value and an offset value in accordance with one of the plurality of fields of the instruction. 

10. The method according to any of Claims 8-9, wherein the step of fomiing a second address comprises the step of 
adding a line size value in accordance with another one of the fields of the instruction. 

11. The method according to any of Claims 8-10, further comprising the steps of: 

fetching a second instruction and a third instruction for parallel execution, wherein the second instniction and 
the third Instruction are both aligned access type Instructions; 

fonning a third address in accordance with the second instruction and accessing a third data item via the first 
port of the memory subsystem; 

fonning a fourth address in accordance with the third instruction and accessing a fourth data item via the 
second port of the memory subsystem, such that the third address and the fourth address are fomied in a 
Sim ultaneous manner and such that the third data Item and the fourth data item are accessed in a simultaneous 
manner. 
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